A Novel Technique for Spare Web Page Detection in Parallel Web Crawler
نویسندگان
چکیده
منابع مشابه
Focused Web Crawler with Page Change Detection Policy
Focused crawlers aim to search only the subset of the web related to a specific topic, and offer a potential solution to the problem. The major problem is how to retrieve the maximal set of relevant and quality pages. In this paper, We propose an architecture that concentrates more over page selection policy and page revisit policy The three-step algorithm for page refreshment serves the purpos...
متن کاملA Novel Architecture of a Parallel Web Crawler
Due to the explosion in the size of the WWW[1,4,5] it becomes essential to make the crawling process parallel. In this paper we present an architecture for a parallel crawler that consists of multiple crawling processes called as C-procs which can run on network of workstations. The proposed crawler is scalable, is resilient against system crashes and other event. The aim of this architecture i...
متن کاملA Novel Architecture for Deep Web Crawler
A traditional crawler picks up a URL, retrieves the corresponding page and extracts various links, adding them to the queue. A deep Web crawler, after adding links to the queue, checks for forms. If forms are present, it processes them and retrieves the required information. Various techniques have been proposed for crawling deep Web information, but much remains undiscovered. In this paper, th...
متن کاملA Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification
In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...
متن کاملTowards a Content-Provider-Friendly Web Page Crawler
Search engine quality is impacted by two factors: the quality of the ranking/matching algorithm used and the freshness of the search engine’s index, which maintains a “snapshot” of the Web. Web crawlers capture web pages and refresh the index, but this is always a never-ending quest, as web pages get updated frequently (and thus have to be re-crawled). Knowing when to re-crawl a web page is fun...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International Journal of Computer Applications
سال: 2014
ISSN: 0975-8887
DOI: 10.5120/16392-6009